Goto

Collaborating Authors

 processing element


HW/SW Co-design of a PCM/PWM converter: a System Level Approach based in the SpecC Methodology

arXiv.org Artificial Intelligence

Abstract--[Original work from 2005; formatting revised in 2025, with no changes to the results.] We present a case study applying the SpecC methodology within a system-level hardware/software co-design flow to a PCM-to-PWM converter, the core of a Class-D audio amplifier . The converter was model ed and explored with SpecC methodology to derive an HW/SW partition. Using system-level estimates and fast function al simulation, we evaluated mappings that meet real-time constraint s while reducing estimated cost of an all-hardware solution and avo iding the expense of a purely software implementation on a high-en d processor . Despite the design's moderate complexity, the r esults underline the value of system-level co-design for early arc hitec-tural insight, rapid validation, and actionable cost/perf ormance trade-offs. The recent requirements of the semiconductors industry has lead the design methodologies evolution in order to fill the design gap detected by the 1999 Roadmap [1]. Today design has very high integration density and complex functionalit ies to implement arising the necessity to use a higher level of abstraction, the so-called System Level.


CGRA4ML: A Framework to Implement Modern Neural Networks for Scientific Edge Computing

arXiv.org Artificial Intelligence

Scientific edge computing increasingly relies on hardware-accelerated neural networks to implement complex, near-sensor processing at extremely high throughputs and low latencies. Existing frameworks like HLS4ML are effective for smaller models, but struggle with larger, modern neural networks due to their requirement of spatially implementing the neural network layers and storing all weights in on-chip memory. CGRA4ML is an open-source, modular framework designed to bridge the gap between neural network model complexity and extreme performance requirements. CGRA4ML extends the capabilities of HLS4ML by allowing off-chip data storage and supporting a broader range of neural network architectures, including models like ResNet, PointNet, and transformers. Unlike HLS4ML, CGRA4ML generates SystemVerilog RTL, making it more suitable for targeting ASIC and FPGA design flows. We demonstrate the effectiveness of our framework by implementing and scaling larger models that were previously unattainable with HLS4ML, showcasing its adaptability and efficiency in handling complex computations. CGRA4ML also introduces an extensive verification framework, with a generated runtime firmware that enables its integration into different SoC platforms. CGRA4ML's minimal and modular infrastructure of Python API, SystemVerilog hardware, Tcl toolflows, and C runtime, facilitates easy integration and experimentation, allowing scientists to focus on innovation rather than the intricacies of hardware design and optimization.


CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference

arXiv.org Artificial Intelligence

Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision. Unlike traditional approaches, ViTs employ the self-attention mechanism, which has been widely used in natural language processing, to analyze image patches. Despite their advantages in modeling visual tasks, deploying ViTs on hardware platforms, notably Field-Programmable Gate Arrays (FPGAs), introduces considerable challenges. These challenges stem primarily from the non-linear calculations and high computational and memory demands of ViTs. This paper introduces CHOSEN, a software-hardware co-design framework to address these challenges and offer an automated framework for ViT deployment on the FPGAs in order to maximize performance. Our framework is built upon three fundamental contributions: multi-kernel design to maximize the bandwidth, mainly targeting benefits of multi DDR memory banks, approximate non-linear functions that exhibit minimal accuracy degradation, and efficient use of available logic blocks on the FPGA, and efficient compiler to maximize the performance and memory-efficiency of the computing kernels by presenting a novel algorithm for design space exploration to find optimal hardware configuration that achieves optimal throughput and latency. Compared to the state-of-the-art ViT accelerators, CHOSEN achieves a 1.5x and 1.42x improvement in the throughput on the DeiT-S and DeiT-B models.


Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture

arXiv.org Artificial Intelligence

Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the achievable performance of DNN inference and reduce the utilization of compute units. Therefore, the work herein consists of developing a reconfigurable dataflow TPU, called the Flex-TPU, which can dynamically change the dataflow per layer during run-time. Our experiments thoroughly test the viability of the Flex-TPU comparing it to conventional TPU designs across multiple well-known ML workloads. The results show that our Flex-TPU design achieves a significant performance increase of up to 2.75x compared to conventional TPU, with only minor area and power overheads.


Ev-Edge: Efficient Execution of Event-based Vision Algorithms on Commodity Edge Platforms

arXiv.org Artificial Intelligence

Event cameras have emerged as a promising sensing modality for autonomous navigation systems, owing to their high temporal resolution, high dynamic range and negligible motion blur. To process the asynchronous temporal event streams from such sensors, recent research has shown that a mix of Artificial Neural Networks (ANNs), Spiking Neural Networks (SNNs) as well as hybrid SNN-ANN algorithms are necessary to achieve high accuracies across a range of perception tasks. However, we observe that executing such workloads on commodity edge platforms which feature heterogeneous processing elements such as CPUs, GPUs and neural accelerators results in inferior performance. This is due to the mismatch between the irregular nature of event streams and diverse characteristics of algorithms on the one hand and the underlying hardware platform on the other. We propose Ev-Edge, a framework that contains three key optimizations to boost the performance of event-based vision systems on edge platforms: (1) An Event2Sparse Frame converter directly transforms raw event streams into sparse frames, enabling the use of sparse libraries with minimal encoding overheads (2) A Dynamic Sparse Frame Aggregator merges sparse frames at runtime by trading off the temporal granularity of events and computational demand thereby improving hardware utilization (3) A Network Mapper maps concurrently executing tasks to different processing elements while also selecting layer precision by considering both compute and communication overheads. On several state-of-art networks for a range of autonomous navigation tasks, Ev-Edge achieves 1.28x-2.05x improvements in latency and 1.23x-2.15x in energy over an all-GPU implementation on the NVIDIA Jetson Xavier AGX platform for single-task execution scenarios. Ev-Edge also achieves 1.43x-1.81x latency improvements over round-robin scheduling methods in multi-task execution scenarios.


DAISM: Digital Approximate In-SRAM Multiplier-based Accelerator for DNN Training and Inference

arXiv.org Artificial Intelligence

DNNs are one of the most widely used Deep Learning models. The matrix multiplication operations for DNNs incur significant computational costs and are bottlenecked by data movement between the memory and the processing elements. Many specialized accelerators have been proposed to optimize matrix multiplication operations. One popular idea is to use Processing-in-Memory where computations are performed by the memory storage element, thereby reducing the overhead of data movement between processor and memory. However, most PIM solutions rely either on novel memory technologies that have yet to mature or bit-serial computations which have significant performance overhead and scalability issues. In this work, an in-SRAM digital multiplier is proposed to take the best of both worlds, i.e. performing GEMM in memory but using only conventional SRAMs without the drawbacks of bit-serial computations. This allows the user to design systems with significant performance gains using existing technologies with little to no modifications. We first design a novel approximate bit-parallel multiplier that approximates multiplications with bitwise OR operations by leveraging multiple wordlines activation in the SRAM. We then propose DAISM - Digital Approximate In-SRAM Multiplier architecture, an accelerator for convolutional neural networks, based on our novel multiplier. This is followed by a comprehensive analysis of trade-offs in area, accuracy, and performance. We show that under similar design constraints, DAISM reduces energy consumption by 25\% and the number of cycles by 43\% compared to state-of-the-art baselines.


Training a Limited-Interconnect, Synthetic Neural IC

Neural Information Processing Systems

Hardware implementation of neuromorphic algorithms is hampered by high degrees of connectivity. Functionally equivalent feedforward networks may be formed by using limited fan-in nodes and additional layers. No direct mapping of weights exists between fully and limited-interconnect nets. Low-level nonlinearities prevent the formation of internal representations of widely separated spatial features and the use of gradient descent methods to minimize output error is hampered by error magnitude dissipation. The judicious use of linear summations or collection units is proposed as a solution.


Dataflow Architectures: Flexible Platforms for Neural Network Simulation

Neural Information Processing Systems

Dataflow architectures are general computation engines optimized for the execution of fme-grain parallel algorithms. Neural networks can be simulated on these systems with certain advantages. In this paper, we review dataflow architectures, examine neural network simulation performance on a new generation dataflow machine, compare that performance to other simulation alternatives, and discuss the benefits and drawbacks of the dataflow approach. Dataflow architectures are general computation engines that treat each instruction of a program as a separate task which is scheduled in an asynchronous, data-driven fashion. Dataflow programs are compiled into graphs which explicitly describe the data dependencies of the computation.


Qualcomm Snapdragon 8 Gen 2 Delivers More AI For Mobile

#artificialintelligence

The Snapdragon Tech Summit is a multi-day event that showcases the latest mobile technology Qualcomm has to offer. This is the second year that Qualcomm has held simultaneous events in China and Hawaii, as well as streaming the keynote addresses. Day 1 of the Snapdragon Tech Summit kicked off with the introduction of the latest smartphone system-on-chip (SoC) for smartphones – the Snapdragon 8 Gen 2. As expected, it delivers improvements in performance and efficiency for camera, connectivity, gaming, sound, and security. But the biggest punch comes from the use of artificial intelligence (AI) in just about every area. The company went so far as to call it "purpose built for AI." Qualcomm uses all of the Snapdragon SoC's processing elements for AI processing and calls the combination of these processing elements the "AI engine."


Security and Safety Aspects of AI in Industry Applications

arXiv.org Artificial Intelligence

In this relatively informal discussion-paper we summarise issues in the domains of safety and security in machine learning that will affect industry sectors in the next five to ten years. Various products using neural network classification, most often in vision related applications but also in predictive maintenance, have been researched and applied in real-world applications in recent years. Nevertheless, reports of underlying problems in both safety and security related domains, for instance adversarial attacks have unsettled early adopters and are threatening to hinder wider scale adoption of this technology. The problem for real-world applicability lies in being able to assess the risk of applying these technologies. In this discussion-paper we describe the process of arriving at a machine-learnt neural network classifier pointing out safety and security vulnerabilities in that workflow, citing relevant research where appropriate.